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On the convergence, lock-in probability and 
sample complexity of stochastic approximation 



Sameer KamaF 



P_i . Abstract: It is shown that under standard hypotheses, if stochastic approx- 

r^ ! imation iterates remain tight, they converge with probabihty one to what 

^ I their o.d.e. hmit suggests. A simple test for tightness (and therefore a.s. 

convergence) is provided. Further, estimates on lock- in probability, i.e., the 
probability of convergence to a specific attractor of the o.d.e. limit given 
that the iterates visit its domain of attraction, and sample complexity, i.e., 
^ ! the number of steps needed to be within a prescribed neighborhood of the 

QQ I desired limit set with a prescribed probability, are also provided. The lat- 

^ ■ ter improve significantly upon existing results in that they require a much 



weaker condition on the martingale difference noise. 



Key words: stochastic approximation, tightness of iterates, almost sure 
convergence, lock-in probability, sample complexity 

1 Introduction 

Stochastic approximation was originally introduced in [H] as a scheme for 
finding zeros of a nonlinear function under noisy measurements. It has since 
become one of the main workhorses of statistical computation, signal pro- 
cessing, adaptive schemes in AI and economic models, etc. See [1], [3], [1], 
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[0], [7] for some recent texts that give an extensive account. One of the suc- 
cessful approaches for its convergence analysis has been the 'o.d.e. approach' 
of [5], [8] which treats it as a noisy discretization of an ordinary differential 
equation (o.d.e.) with slowly decreasing step sizes. 

The main contributions of this paper are as follows. The first contribution 
concerns convergence properties. The aforementioned convergence analysis 
is usually of the form: if the iterates remain a.s. bounded, then they converge 
a.s. to a set predicted by the o.d.e. analysis. This a.s. boundedness usually 
has to be established separately. Here we make the simple observation that 
under standard (i.e., commonly assumed) conditions, the tightness of iterates 
suffices for a.s. convergence to the set predicted by the o.d.e. analysis. A 
simple test for tightness is also provided. 

Our second contribution concerns the lock-in probability, defined as the 
probability of convergence to a specific attractor of the o.d.e. if the iter- 
ates enter its domain of attraction. Under the aforementioned standard as- 
sumptions, an estimate for this is given in [3], Chapter 4, p. 37, using the 
Burkholder inequalities. This has been improved to a much stronger estimate 
in ibid., p. 41, under the strong hypothesis that suitably re-scaled martin- 
gale difference noise remains bounded. Adapted from ^ , these results suffice 
for the application they were intended for, viz., reinforcement learning algo- 
rithms, but are inadequate for other applications where such a boundedness 
hypothesis may be untenable. We recover these results under a much weaker 
condition that only requires the re-scaled martingale differences to have an 
exponentially decaying conditional tail probability. Further, we feel that the 
methodology developed in our proof is of broader applicability and might 
prove useful in other situations. 

A third contribution concerns sample complexity. Originating in the sta- 
tistical learning theory literature, this notion refers to the number of samples 
needed to be within a given precision of the goal with a given probability. 
This literature, however, usually deals with i.i.d. input-output pairs. Here 
we have a recursive scheme for which we expect the result to depend upon 
the initial position at iterate hq (say). Furthermore, the estimate is of an 
asymptotic nature, which requires this Uq to be 'large enough' (so that the 
decreasing step size has decreased sufficiently). Under the 'strong' condition 
of [2], this was done in |2] (see also [3], p. 42). We improve on this by ex- 
tending the result to the 'exponential tail' case mentioned in the previous 
paragraph. This, however, is a direct spin-off of the lock-in probability esti- 
mate and follows essentially as in [2|. As a source of some previous results 



on exponential bounds in stochastic approximation we point out §6 in the 
survey article [ID], and the literature cited therein. 

We prove our 'tightness implies convergence' result in section 3 following 
notational and other preliminaries in section 2. The simple sufficient condi- 
tion for tightness is given in section 4. Section 5, the longest, is devoted to 
deriving the lock-in probability estimate from which the sample complexity 
result of section 6 follows easily. 

2 Preliminaries 

Consider the M'^-valued stochastic approximation iterates 

Xn+l = Xn + a{n)[h{Xn) + Mn+l], (l) 

and their 'o.d.e.' limit 

xit) = hixit)). (2) 

We make the following assumptions regarding /i(-), a{n), and M„+i 
(Al) h{-) : M"' ^ M'^ is Lipschitz. Thus 

\\h{x) — h{y)\\ < c\\x — y\\ for some < c < oo. 

(A2) The step sizes a{n) are positive reals and satisfy 

(ii) En «n < OO' and 
(iii) 3c > such that a{n) < ca{m) Vn > m. 

(A3) (M„) is a martingale difference sequence w.r.t. the filtration (J-'n) where 
J-„ = a{xo,Mi,...,Mn). Thus, E[Mn+i\Tn] = a.s. for all n > 0. 
Moreover, M„ is square integrable for all n > with 

E[||M„+if|J-n]<c(l + ||x„f) (3) 

a.s. for some < c < oo. 



We next describe the setting for our problem. Let V^ : M*^ — )■ [0, oo) 
be a differentiable, nonnegative potential or 'Liapunov function' satisfying 
lim||a;||i-oo V{x) = CX3 and V{x) := W{x) ■ h{x) < Va;. Define H := {x : 
W{x) ■ h{x) = 0} and assume that this coincides with {x : V{x) = 0}. Note 
that H is compact. Under these assumptions, H is an asymptotically stable, 
positively invariant set of the limiting o.d.e. (|5]). Let B be an arbitrary 
bounded open set such that H G B. Consider the convergence probabil- 
ity P[x„ — )■ H\xno ^ ^] fo^ some no. By Theorem 8 of [3], p. 37, under 
assumptions (A1)-(A3) the convergence probability satisfies 

P[a;„ -;■ H\xno e -B] -;■ 1 as no -> oo. (4) 

The convergence results of our paper are as follows: 

• If the iterates {x„} are tight and (j4]) holds then the iterates will con- 
verge to H with probability 1. 

• If the Liapunov function grows exactly quadratically outside a compact 
set, then the iterates {xn} are tight. 

Combining the two, if the Liapunov function grows exactly quadratically 
outside a compact set and assumptions (Al)-(A3) hold, then the iterates will 
converge to H almost surely. 

One is often interested in the 'lock-in' probability of a specific attractor, 
denoted H again by abuse of notation, of the limiting o.d.e (|2]), i.e., the 
probability of convergence to H given that the iterates {x„} land up in its 
domain of attraction after sufficiently long time. In this spirit. Theorem 
8 of ^, p. 37, shows that P[x„ — )■ H\xno & B] = 1 — 0(6(no)), where 
6(no) := ^m=no Q^(^)^) ^iid 5 is a bounded open set contained in the domain 
of attraction of H. In this paper we give the following stronger results: 

• Assuming that the scaled martingale difference ||M„+i||/(l + ||x„||) has 
exponentially decaying conditional tail probability, we show that 



P [xn -^ H\xno eB] = l-0(e W^ 



as no — 7- oo. 



As a corollary to the above result we also state a sample complexity 
result wherein the 'probability of error' is O ( e \^''^"o) 



Similar results are proved in [5], pp. 38-41, but under a much stronger 
hypothesis, viz., that the scaled martingale difference sequence above is in 
fact bounded. This is too restrictive for many applications. 

Finally, before we start our calculations, a remark on notation: in what 
follows the letter c may denote a different constant in different lines. A 
similar remark applies to the letters Ci and C2 too. 

3 Convergence for tight iterates 

In this section we relate tightness of the iterates to their almost sure con- 
vergence to H. Recall that the iterates {x„} are tight if given an arbitrary 
e > 0, there exists a compact set K dW'- such that 

Theorem 1. Assume that the iterates {a;„} are tight and @) holds for any 
hounded open set B containing H . Then almost surely Xn -^ H as n ^ 00. 

Proof. Pick an arbitrary e > 0. Because of tightness there exists a compact 
set K such that 

P [xn e A'] > 1 - e, Vn. 

Now choose a bounded open set B such that K,H (Z B. Clearly 

¥[xneB]>l-e, Vn. 
Also, by assumption, we have 

F[Xn — ;■ H\Xno G i?] — 7- 1 as no — )■ CXD. 

Combining the two we get 

P[x„ ^H] > F [xn, e B] P[x„ ^ H\xn, e B] 
> {1 - e)¥[xn ^ H\xn, e B]. 

The left hand side above is independent of Uq. Therefore, letting no — )■ 00 in 
the right hand side we get 

F[xn -^ H]>l-e. 

But e itself was arbitrary. It follows that 

P[x„ ^H] = l. 

n 



Corollary 2. Under assumptions (Al)-(A3), if the iterates {xn} are tight 
then Xn ^ H a.s. 

Proof. This is immediate from Theorem [1] and the fact that (|1]) holds under 
assumptions (A1)-(A3) by Theorem 8 of |3], p. 37. D 

4 A condition for tightness 

In this section we show that if the Liapunov function grows 'exactly' quadrat- 
ically outside some compact set then the iterates are tight. More precisely, 
we assume that the Liapunov function V{-) satisfies the following: 

(A4) V{-) is twice differentiable and all second order derivatives are bounded 
by some constant. Thus, \didjV{x)\ < c for all i,j and x. 

(A5) ||xp < c(l + V{x)) for all x and some < c < oo. 

Theorem 3. Under (A4),(A5) the iterates {x„} are tight. 

Proof. Without loss of generality, let -E[V(xo)] < oo. Consider ([T]), the 
equation for the iterates. Doing a Taylor expansion and the using fact that 
the second order derivatives of V are bounded, we get 

V{xn+i) < V{xn) + a{n)VV{xn) ■ [h{xn) + M„+i] + ca{nf\\h{xn) + M„+if . 

Since VV"(a;„) ■ /i(x„) < 0, this yields 

V'la^n+i) < V{xn) + a{n)VV{xn) ■ M„+i + ca{n)^\\h{xn) + M„+if . 

Lipschitz continuity of h{-) gives us the following bound 

\\h{Xn) + Mn+lf = \\h{Xn)f + WMn+lW" + 2h{Xn) ■ Mn+l 

< c (1 + ||x„f ) + ||M„+if + 2h{x„) ■ Mn+l. 
This leads to 

V{Xn+l) < V{Xn) + 

a{n)VV{xn) ■ M„+i + ca{nf [(l + ||a;„f ) + ||M„+if + 2h{xn) ■ M„+i] . 
Taking conditional expectation and using (JSj) gives 

m[V{xn+i)\J'n] < V{xn) + ca{n)\l + \\Xnf). 



By (A5), this can be written as 

E[V(x„+i)| J-J < V{xn) + ca{n)\l + V(x„)). 

Taking expectations we get 

E[V(x„+i)] < E[V^(x„)] + ca{nf{l + E[r(x„)]). 

This gives 

l + E[F(x„+i)] < H-E[y(a;„)] + caH2(l + E[F(a;„)]) 
= (l + ca(n)2)(l + E[\/(x„)]) 

< exp(ca(n)2)(l + E[r(x„)]) 

< exp(cf^a(2)M(l + E[y(xo)]). 



i=0 



Since 1 + E[V^(x„+i)] is bounded by a constant independent of n, it follows 
that the iterates are tight. D 

Corollary 4. Under assumptions (Al)-(A5), we have 

Proof. This follows from Theorem [H Theorem [3], and the fact that (jll) holds 
under assumptions (Al)-(A3) by Theorem 8 of |3j, p. 37. D 



5 Lock-in probability 

In this section we give a lower bound for P[x„ — )■ H\xno ^ ^] i^ terms of 
b{no) when no is sufficiently large. How large uq needs to be will depend on 
the choice of B, among other things. Before we proceed further we fix some 
notation and recall some known results. 

Choose an arbitrary finite T from the interval (0, oo) and hold it fixed 
for the rest of the analysis. Let t{n) = Yl^=o '^(O- ^^^ ^o > 0,nj = min{n : 
t{n) > t(r2j_i) + T}. Define x{t) by: x{t{n)) = Xn, with linear interpolation 
on [t(n),t{n + 1)] for all n. Let x*''"'-'(-) be the solution of the limiting o.d.e. 
([2]) on [t(ni), t(nj_|_i)) with the initial condition x*^'^'^(t(?7,j)) = x(t{ni)) = x^- 
Let 

Pi:= sup ||x(t)-x*("')(t)||. 

te[t(ni),i(7ii+i)) 



We recall here a few results from [3]. As shown there ([5], pp. 32-33), 
there exists a. Sb > such that if x„. G B and pi < 6b then Xn,^^^. € B, too. 
It is also known (p], section 2.1, p. 16) that if the sequence of iterates {xn} 
remains bounded almost surely on a prescribed set of sample points, then 
it converges almost surely on this set to H. Combining the two facts gives 
us the following estimate on the probability of convergence, conditioned on 
Xno G -B (P], Lemma 1, p. 33) 

P [x{t) -^ H\xn, G fi] > P [pi < 6b\/i > 0|x„„ G B] . 

Let Bi denote the event that a;„(, G B and pk < 5^ for /c = 0, 1, . . . , i. We get 
the following lower bound for the above probability ([3], Lemma 2, p. 33) 



P [p, <5b\/i> 0|x„„ G B] > 1 - ^ P [pi > 5b\B.. 



i-ij 



i=0 



For tiq sufficiently large, this in turn can be bounded as 



^[P^>5B\B., 



i-1 



< 



max 

0<i<n(i+i)~ni 



y^^a{ni + m)Mn, 



+m+l 



m=0 



> 5 



B 



i-l 



where 6 = Sb/^Kt, with Kt being a constant that depends only on T ([3], 
Lemma 3, p. 34). 

Thus the probability of convergence, P[x„ — )■ H\xno ^ B],is lower bounded 
by the following expression 



i=0 



max 

0<i<n(i+i)-™j 



^ a(rij + m)M^ 



rii+m+l 



m=0 



>6 



i3,-i 



In this section we show that 1 — P[a;„ — )■ if |x„p G -B], or the 'error prob- 
ability', decays exponentially in l/^/b{no) provided the scaled martingale 
difference terms, ||Mj+i||/(l-|- ||xj||), have exponentially decaying conditional 
tail probability. Specifically, we assume that 



P 



||M,+i| 

1 + \\Xi 



> V 



J'r 



<Ciexp(-C2t;) 



(5) 



for V large enough and for Ci and C2 some positive constants. 

Before we move on to our analysis we introduce a step size assumption 
that significantly simplifies our calculations and which we shall assume for 
the remainder of this section. 
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5.1 A step size assumption 

We assume that the step sizes a{i) decrease only in 'Lipschitz' fashion. By 
this we mean that there is a positive constant jt depending only on T such 
that if a{ni + mi) and a{ni + m2) are two arbitrary time steps from the same 
interval [t{ni),t{ni^i)) then 

a{ni + mi) , 

a[ni + 1712} 

Define Omax := sup„a(n). Since X]'^('^)^ < ^^) i^ follows that Omax < oo. 
The next lemma shows that ^ holds for a large class of step sizes. 

Lemma 5. Consider step sizes of the form 

, . 1 



a[n] 



n°' {log nY' 



where either a G (1/2, 1) ora = l,/3<0. For such step sizes there exists a 
positive constant 7^ depending only on T such that two arbitrary time steps 
from the same interval [t(nj), t(nj+i)) satisfy ^. 

Proof. We need to show that for rii < n2, if a{ni) + ■ ■ ■ + a(n2) < T + Omax 
then for ni sufficiently large, there exists a constant 7t, depending only on 
T, such that a(ni)/a(n2) < ■Jt- Since f^^ a{s)ds < a(ni) + ■ ■ ■ + 0(^2), it 
suffices to show that there exists a constant 7^ such that, for ui sufficiently 
large, /"^ a(s)ds < T + Omax implies a{ni)/a{n2) < 7t- We consider the two 
cases separately. 

. «e (1/2,1). 

The result follows easily from the following two inequalities which hold 
for ni sufficiently large 

I l/s"(logs)^ds > / l/s(logs)ds = log (log 77-2/ log ni), 

J rii J ni 

and 

where a < v < 1. 



a = l,f3 <0. 

The result follows easily from the following inequality 

/ l/s(logs)^ds > / 1/sds = log (722/721). 

J ni J n\ 



For < 772 < 77(j+i) — 77j — 1, the step size assumption implies 



D 



mrii 



It 



< a{ni + 772) < '^Ta{ni). 



As a result 
T 



lTa{ni 
whereby 



\ It J t^^ a{n^) 



m=0 



and 



n(i+i)-ni-l 

y^ a(77i + 772)^ = 6(a(77i)), 

m=0 
00 A(i+l)-".-l \ / 00 

K^o) = Zl Zl °(^^ + "^)M = Z "(^^^ 

i=0 \ m=0 / \i=0 

Remark 6. By virtue of the first of the above two equations, we can use the 
notationally simpler term aijii) as a proxy for the sum ^,^^^Q ' 0(72^+777)^ 
for obtaining order estimates. Indeed, in the remainder of this paper we shall 
repeatedly do so. 

5.2 Bounding the error probability 

It will be notationally convenient at this point to introduce Cn^+j to denote, 
for an arbitrary 2, the martingale with indexing starting at tij defined as 



'^n^+j '■— \ sr^j-l 



ifj = 0, 

Xlmio ^i^i + m)Mn,+m+l if < j < 72(i+i) - 77^. 
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Recall that we seek a bound for 1 — ^^ P maxo<j<„(-^-^j-„- ||Cn,+j || > ^l-Bj- 
As a first step we bound the following single term 

P ,^.max ||C„,+,|| >5\B,^i 

0<J<n.(i+i)-n, 

Our analysis for deriving a bound requires first suitably stopping the 
martingale Cn^+j, then projecting the stopped martingale onto a coordinate 
axis to obtain a M-valued martingale, and finally truncating the difference 
terms for this martingale. 

Define the stopping time 



r :=inf |ni+j : ||Cn,+ilL > 



An(i+i). 



Let Cui+m denote the stopped martingale C(n»+m)AT- Similarly, let M^^^^_,_]^ 
denote M(^ni+m+l)^T■ We can write 



i-i 



c;+, = E°(^^+^)^«» 



+m+l' 



m=0 



Let 'Pz{) denote the projection operator projecting onto the z^^ coordi- 
nate. Note that 



max \\Ch+j\\>5 

0<:)<n{i+i)-«i 



p 
< p 



max llCn, 

0<J<n(i+i)-ni 



+J-|loo>Vv^ 



B^-l 



B^^l 

z 

We'll show that for no sufficiently large the following bound holds. 



(7) 



P 



'^^ K"(«+i) 



>5/Vd 



B^-i 



< Ci exp 



V^a(ni 



To derive this bound we'll need a truncated copy of Vz [M^__^_^_^-^^ . Define 

Nn^+rn+l aS folloWS 



N, 



rii+m+l 



sgn(P^ (M^,+m+i))^ otherwise , 
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and define r] as 



n(i+l)-n,-l 

7]-.= ^ a{ni + m)Nn,+m+i, 

m=0 



Note that 

< ¥\\r]\ >6/Vd Bi-i 

+P [3m < ^(,+1) -n,:V, {M^^+m+i) ^ iV„^+n^+i|i3,_i] . (8) 

To calculate bounds for the last two terms of ([H]) we'll need a bound for 
the tail probability P [\Vz {M^^^^_^_l) \ > u\Bi^i\. Let a;^~+^(-) denote the 
following F^.+m-measurable function 



X, 



„r-l / 
ni+m\ 



X 



(n,+m)A(T-l)( 



In order to get a good bound we first show that for all i, and m < n(i+i) — Ui 
conditional on i3j_i, ||x^7+„ll ^^ bounded by a constant. 
We shall, therefore, successively get bounds for 



1. 



2. 






P [\V. (m;^+™+i) I > «|i5.-i] 



4. 



5. 



P 



>5/^/d 



B,-i 



P 



^^(q...)|>^/^ 



i3i-i 



E^ 



max ||C„,+j|| >(5|i3i„i 

0<J<n{i+i)-"i 
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5.2.1 Bound for l|x;-+^^(-)|| 

Recall that there exists a suitable positive 6b such that if x„. G B and pi < 6b 
then a;„,^^^, G 5, too. It follows that conditional on i3j„i we must have 
x„. G -B for j = 0, 1, . . . ,«; in particular, a;„^ G -B. Define i^o := sup^g^ ||x||. 
Thus, whatever be the i, conditional on i3j„i we must have 

We next show that if ||x„. || < K^ then there exists an N independent 
of i such that ||a;^7+„ll ^ ^ ^^^ ^^^ "^ — "^C^+i) ~ '^i- ^^ "^ increases, if 
Il^n7+mll is unbounded, then it has to sequentially cross each one of the 
values Kq,Ko + 1, . . . .Kq + ti, . . .. We will show that for a fixed, finite T this 
is not possible. Indeed, we'll show that there exists a suitable A^ such that 
ll^n~+mll < ^ for ^11 "^ ^ '^(j+i) ~ ''^i where A^ does not depend on i. Our 
proof will use the fact that the sum YII^k -'-/^ diverges as g — )■ oo. 

For < nil < "^2 < "'^(j+i) — rii — 1 we have 

ni+m2 — 1 ni+m2 — 1 

X„,+„2 = Xn,+m, + Yl (^U)HXj) + Yl ^0')^J+1- (9) 

jf=ni+mi j=ni+mi 

Let M^Tfml') denote M(„.+m,)A{T{w)-i)(i^)- Note that M^Z^^{-) is not a mar- 
tingale difference. However, it is a well defined J-'„^+m-measurable random 
variable. Writing ([9]) for the iterates prior to stopping gives us 



71^+7712 — 1 ni+m2~l 






+ Y <'(j)HxMj + 1< r} + Y <3)M]-l (10) 

j=ni+mi j=ni+mi 

For A; > define stopping times r^ by 

Tk{uj) := inf{ni + m : ||xn,+m(w)|| > A;} A n(i+i). 
By (HDD, 

Tft-l Tfc-1 



^.r = <i)+ E <jMx,)I{j + l< t} + Y <3)M]^l 



„-r-l _ 

J=T^k-l) 3=T(k-l) 

From the definition of r it follows that 



sup 

mi,7Ti2 



Y <^{n^ + rn)M;-i^+, 



<26, 
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whenever mi and 1712 are such that rii < Tii + mi < ni + m2 < Ti(i+i). Further, 
since h{-) is Lipschitz, we have ||/i(a;)|| < c(l + ||a;||) for some < c < 00. 
Combining the two gives 



\x. 



Tk I 



< \\x. 



r(k-l) I 



i=T(fc-l) 



kr'ii) + 



Tfc — 1 

j='r(k-l) 



< \\x. 



T-1 I 

-r(fc-i) I 



j=T{k-l) 



a(j)c(l + A;) + 25. 



Assume, without loss of generahty, that 6 < 1/6. If it isn't, simply 
replace it by some constant which is less than 1/6. Recall that Omax < 00 
where amax = sup„ a{n). Choose an N such that 



N 



1-55 



^ c(l 



k) 



> T + a^ 



Let no be large enough so that a(no)c(l + N) < 6. Assume that ||a;^7+„ll 
crosses the interval [k — l,k] from below A; — 1 to above k as m ranges from 
to n(j+i) — rii — 1. As long as k < N it will always be the case that Hx^J!"^ || 
lies in the range [k — l,k — 1 + 36). We therefore get 



\x 



j=T{k-l) 



r-ll 



\X 



T-1 I 

-nk-1) I 



c{l + k) 



25 1-5(5 
— > 



c{i + ky 



as long as k < N and ||a;^^^„|| crosses the interval [/c — 1, k] from below A; — 1 
to above A;. Since ^^ Sj=t '^0) '^^'^ never exceed T + amax, and since 



E 



N 



1-55 



> T + Omax, it follows that N is an upper bound for \\x'^ 



T-l 



k=Ko c{l+k) 

To summarize: 

Lemma 7. There exists a constant N such that for all i, conditional on Bi-i, 
and all m < rii^i^i) — rii, the following holds 

\\x^^} II < N 



14 



5.2.2 Bound for P [\V, (M^^+^+O | > u\B,^,] 

Lemma 8. There exist constants Ki and K2 such that, for u sufficiently 
large, the following holds 

Proof. Using first the tail probability bound (EJ, and then Lemma [3 we get, 
for u sufficiently large, the following bound 

P [|P.(M;+™+i)I > n\B.-i] < P OlK.+™+i|| > u\B.-i] 

< Ciexp(-CW(l + ll<7+™ll)) 

< Cie^v{-C2ul{l + N)). 

U 



5.2.3 Bound for P 



>5/Vd 



Bi-i 



For < m < n(i+i) - Ui, define Yn^+m+i as 



Yr 



rii+m+l 



N, 



ni+m+ 



1 — ^[Nn,+m+l\^ni 



Note that F„.+i, F„.+2, . . . , F„^^_|^^j is a martingale difference sequence and, 
consequently, 'Ylirn=o ' (^i^i + ^)^ni+m+i is a martingale. We can write rj 



as 



v= Yl ^^^' 

m=0 



+ m)Yn^+rn+i + ^ a(nj + m)E[Nn,+m+l\J^n 

m=0 



(11) 

Note that Vz {M^._^_^) is a martingale difference for < m < n(j+i) — n^. 
Using Lemma El this gives us, for v sufficiently large, the following bound 

POO 

< Ki exp {—K2u)du 

J V 

= Ciexp(— cf). 
A similar calculation shows 

E[Nn^+rn+l\J^n,+m] > -Ci exp (-Ct;), 
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and consequently, for < m < Ti(j+i) — rii, 

\E[Nn^+rri+l\J^rH+m]\ < Ciexp{-Cv). 

Combining everything gives 






m=0 



< (T + l)ci exp (— cf ). 



Note that the last expression can be made as small as desired by choosing 
V sufficiently large. Choose v = ^5"^ /a{ni). The reason for this specific choice 
will become clear later. It follows that for no large enough, v will indeed be 
as large as required. Assume uq to be sufficiently large that 



^(i + l)— "i~l 

^ a{ni + m)'K[Nn^+m+l \^n,+m] 

m=0 



< 



5 



2W 



(12) 



P 



Using (ITT!) and (TT2|) . we get, for rio sufficiently large, the following 

5 



\m 



> 5/Vd\B^-l 



< P 



m=0 



> 



2Vd 



i3._i 



We recall the Azuma-Hoeffding inequality for martingales that have bounded 
differences. Suppose {Xk : k = 0,1,2, . . .} is a. martingale and the differences 
satisy \Xk — Xk-i\ < Ck a.s. Then for all positive integers n and all positive 

reals t, P(X„ — Xq > t) < exp ( ^ y~J ^2 ) • We'll use it in the two sided form 



\Xn-Xo\ >t] < 2exp 



-t' 



2 Z^fc=l ^k 



Note that |F„.+m+i| < 2v and a{ni + m) < -^Tiijii). Also, n(j+i) — Ui < 
•jTT/a^ni). This gives, for uq large enough to satisfy ( IT2i) . the following 
bound 



P 



m=0 



> 



2v^ 



B^-i 



< 2 exp 



c,52 



a m,- f ^ 



(13) 



To summarize: 
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Lemma 9. For uq sufficiently large, there exists a constant c such that 

c6^ 



P 



\r] 



> 5/v^l^i-i 



< 2exp 



a[ni)v 
where v = v{a{ni)) is such that v{a{ni)) f oo as a{ni) 4 0. 



5.2.4 Bound for P 

We have 



"^^ ('^«(»+i 



>6/Vd 



B^-i 



< 






a{ni) 
a{ni) 



X Ki exp {—K2V) 
exp (-C2f ). 



(14) 



Plugging (JH]) in ([8]), and applying Lemma |9] we get 



P 



^^ (Qi+i 



>5/v^ 



Cl 



< 2 exp (— C(5^/a(nj)f^) H — ^^ — rexp (— C2 



am,- 



Since the left hand side is independent of v we can choose a value for v 
which keeps the right hand side sufficiently low. Specifically, we choose 



V = \/WJa{ni). 
This gives us the following bound 



P 



"^^ K"(.+i) 



^1 
airii) 



< Cl exp 



>5/Vd 
,^2/3 



i3.-i 



\/a{ni 
c5^/^ 



\/a{ni 



for no large enough. 
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5.2.5 Bound for Y.i^ '^^^o<j<n(,+,-)-n, \\Cn,+j\\ > 5\Bi^i 
Plugging the last bound in ([7]) we get 



P 



max \\Cn,+j\\>S 

0<J<n(i+i)-ni 



B^-1 



< C\ exp 



airii] 



where we have absorbed the multiplicative factor of d in the constant c\. 



We note that C\ exp 






is a convex function for y G (0,C2), where 



C2 is a sufficiently small positive constant. Furthermore, 

— 7- as y J, 0. 



Ciexp 



c52/3\ 



^y 



For such functions we have the following fact: 

Lemma 10. Let g{-) be a function such that g{0) = and g{-) is convex in 
the region (0, c) for some c > 0. For a,b > and a + b < c, the following 
holds 

g{a) + g{b) <g{a + b). 



Proof. We have 



9{a) 



9 



b ci , , . 

rO + Aa + h) 

a+b a+b 



< 



^ /^N a . ,, 

--T9iO) + ^-T9ia + b) 
a + b a + 

g{a + b). 



a + b 



Similarly 

Adding the two we get 



9{b)<^^gia + b). 



g{a)+g{b) <g{a + b). 



U 
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Finally, by Lemma [TU] and Remark El we get 



E" 



max ||C„.+j-|| >5|i3i_i 

0<J<'n.(i+i)-ni 




< Ci exp 



provided uq is sufficiently large. Note, in particular, that no should be large 

c52/3 



VB 



enough to ensure that ^^ a{ni) lies in the region of convexity of Ci exp 

To summarize the calculations of this section, we have proved the follow- 
ing result: 

Theorem 11. Under assumptions (A1)-(A3), the assumption that for large 
u, the tail probability bound ^ holds, and the step size assumption ^, we 
have the following bound provided tiq is sufficiently large 



IP[a^n -^ H\xno ^ B]>1 — ci exp 
where 5 = 6b /^Kt- 



c6^'^ 



y&M 



6 Application: a sample complexity result 

As an application of our result we give here a sample complexity estimate, 
which roughly says that conditional on Xn,^ G B for some fixed, sufficiently 
large no, with a high probability the interpolated trajectory x{t) will be suffi- 
ciently close to H after any lapse of time greater than some fixed 7. We now 
state the result more formally. We briefly sketch how the sample complexity 
result follows from an error probability bound. For a fuller description see 
0,p42. 

Fix an e > such that H^ := {x : V{x) < e} C H' := {x : V{x) < e} C 
B. Since B\H^ is compact, V{-) is continuous, and the o.d.e. x(t) = h{x{t)) 
is well-posed, it follows that there is a strictly positive A such that if the 
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o.d.e. starts from x G 5 \ H'^^ flows for any time greater than T and reaches 
y, then V{y) < V{x) - A. 

Let Ns{-) denote a 5- neighborhood of its argument. Fix 5 such that 
Ns{H'^) C B, and for all x,y E B with ||x — 1/|| < S, we have ||l^(a;) — ^(y)|| < 
A/2. We can do so since V{-) is continuous and B compact. 

We assume that Xn^ € B. Further, assuming that Pi < 6 for all i, we derive 
an estimate 7 for the time in which iterates, if they start with Xn^ ^ B\ H^, 
will get trapped in Ns{H'^~^'^''^) except for a small error probability given by 

E.P[p.>5|s._i]. 

The iterates, while they are in 5 \ if^, would lose a minimum of A from 
their potential if they could exactly follow the o.d.e. for time T. As pi < 6 Vi, 
over time T, they deviate up to S from the o.d.e. However since a S shift can 
change the potential only by A/2, they are still guaranteed a loss of potential 
of A/2. They can continue losing A/2 over every lapse of time T until 
x„- G if^ for some i. Thereafter the 'boundary iterates' x„.,j > i, remain 
trapped in if^"*"^/^, since, if x„. G H'^ then even with the worst possible 
'throwing out' Xn,-^^^-. G H'^^^/'^. It follows that for j > i the intermediate 
iterates Xn^+m, 'm < '"■(j+i) — nj, remain trapped in Ni{H^^^/'^). Thus we get 
the following estimate for 7: 

max^pR Vix) — e ,^ 
7 = -^^J^ x(T + l), 

leading to the following sample complexity estimate 

Theorem 12. Under assumptions (A1)-(A3), the step size assumption ^, 
and the assumption that for large u, the tail probability bound ^ holds, we 
have the following bound provided uq is sufficiently large 

P [x{t) G Ns{H'+^'^) \/t>to + j\xn, G 5] > 1 - ci exp 



where 6 = 5b/'^Kt. 
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